Reasoning about Record Matching Rules

نویسندگان

  • Wenfei Fan
  • Xibei Jia
  • Jianzhong Li
  • Shuai Ma
چکیده

To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics. (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O(n) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining fuzzy association rules for web access case adaptation

Web access path prediction using knowledge discovered from web logs has become an active research area. Web logs provide updated information about the user’s access record to a web site, which contains useful patterns waiting to be discovered and used for improving the web site. In this study, a new approach to web access pattern prediction is proposed. The methodology is based on the case-base...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

FROM SIMPLE ASSOCIATIONS TO SYSTEMATIC REASONING: A Connectionist representation of rules, variables, and dynamic bindings using temporal synchrony

Human agents draw a variety of inferences effortlessly, spontaneously, and with remarkable efficiency — as though these inferences are a reflex response of their cognitive apparatus. Furthermore, these inferences are drawn with reference to a large body of background knowledge. This remarkable human ability seems paradoxical given the results about the complexity of reasoning reported by resear...

متن کامل

Qualitative spatial reasoning for soccer pass prediction

Given the advances in camera-based tracking systems, many soccer teams are able to record data about the players’ position during a game. Analysing these data is challenging, since they are fine-grained, contain implicit relational information between players, and contain the dynamics of the game. We propose the use of qualitative spatial reasoning techniques to address these challenges, and te...

متن کامل

Entity Identi cation in Database Integration: An Evidential Reasoning Approach

Entity identiication is the problem of matching object instances from diierent databases which correspond to the same real-world entity. In this paper, we present a 2-step entity identiication process in which attributes for matching tuples may be missing in certain tuples, and thus need to be derived prior to the matching. To match tuples, we require identity rules which specify the conditions...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2009